Fix cartpole-camera frame-stacking performance regression by jmart-nv · Pull Request #5849 · isaac-sim/IsaacLab

jmart-nv · 2026-05-28T21:44:39Z

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change	Why
uint8 ring buffer. Defer the `x/255 − mean` normalize past the frame-stack buffer for RGB-like data types.	Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New `CircularBuffer.stack_dim` arg + `.stacked` property arrange storage so the channel-stacked output is a contiguous view of internal storage.	Eliminates one `(B, H, W, K*C)`-sized allocation and a permute kernel per env.step.
Drop `torch.roll` on append. Replace with a front-to-back memmove.	One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (`normalize_image_uint8`). One kernel: `(uint8 / 255) − per-(batch, channel) mean → float32`.	Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (`spatial_sum_uint8_tiled`). Warp kernel writes `(B, NUM_TILES, C)` int32 partials along H; PyTorch collapses the partial dim.	int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared `isaaclab.utils.images.normalize_camera_image` dispatch. One helper used by both `image()` (DirectRLEnv) and `stacked_image.__call__` (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise.	Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
`image(clone=False)` opt-out. Callers that immediately copy into their own storage skip the defensive clone.	Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. `_get_observations` allocates the normalize output fresh each call (no persistent `out=` buffer).	The RL trainer holds a reference to the previous-iteration observation until `record_transition` returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, seed 42, on NVIDIA L40. Perf via non_rl measured over n=3 warm trials:

Stage	FPS	step_ms	Peak VRAM	Mean reward
Pre-framestack baseline	54.62	18.67	3.13 GB	~102 (unsolvable — no temporal info)
Frame stacking from #5574	31.72	32.07	8.37 GB	297.68
This PR (pre-rebase)	67.87	15.32	4.41 GB	297.60
This PR (post-rebase)	69.50	14.94	4.57 GB	293.43

Screenshots

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Incremental Update (`a86c6a7`)

The author addressed review feedback in commit a86c6a7:

✅ Docstring improvements — stack_dim parameter now includes a concrete example (stack_dim=-1 on (B, H, W, C) → (B, H, W, K*C)) making the API much more intuitive.

✅ In-place subtract in normalize fallback — images.py now uses images -= torch.mean(...) instead of allocating a new tensor, addressing the memory optimization suggestion.

✅ Cache growth documented — The _uint8_sum_partials_cache comment now states it "Typically holds one entry per training run (camera resolution and device are fixed)," making the bounded growth clear.

✅ Tile size rationale — _UINT8_SUM_TILE_HW comment now documents it was "Tuned on L40; robust across modern NVIDIA arches (Ampere/Ada/Hopper) at R256," addressing the hardcoded value concern.

✅ Shift loop documented — Added "Cheap at the typical frame-stack K=2-4" comment clarifying expected usage.

✅ Minor fix — Corrected TiledCameraCfg → CameraCfg in is_rgb_like docstring.

Post-rebase Update (`0b0d0ba`)

Branch was rebased onto latest main. The rebase commit contains only mechanical cleanup — precommit formatting, variable renames aligned with upstream refactors (e.g. _default_env_origins → _default_env_pose), and changelog corrections. No logic changes to the frame-stacking or normalization code.

Test Coverage

✅ Excellent coverage — comprehensive tests for the circular buffer stack_dim mode, normalize dispatch paths, Warp kernel correctness, and a consecutive-call aliasing regression test.

Verdict

Approve — All previous review concerns were addressed. The rebase commit is mechanical cleanup only. Ready to merge.

Incremental Update (`5b3e944`)

Release-prep commit rebasing onto the v6.3.0 release train. Changes relevant to this PR:

✅ Version bumps — isaaclab 6.2.1→6.3.0, isaaclab_rl 0.5.3→0.6.0, isaaclab_tasks 2.0.1→2.0.2, isaaclab_newton 0.14.1→0.15.0, isaaclab_contrib 0.4.1→0.4.2, isaaclab_experimental 0.1.0→0.1.1. Changelogs rolled from changelog.d/ fragments into CHANGELOG.rst.

✅ Changelog fragments consumed — The per-PR .rst/.skip fragments from this PR and others merged into the release are now removed; their content lives in the main changelogs.

✅ ls_parallel deprecation landed — MJWarpSolverCfg.ls_parallel is now formally deprecated (emits DeprecationWarning, force-set to False). All task configs, tests, and docs drop the field. This is a related-but-independent cleanup that happened to land in the same batch.

✅ wrap_warp_to_torch.py removed — The obsolete migration script is deleted; a changelog entry documents the removal.

✅ CI pinning — ovphysx==0.4.13 pinned in CI workflows and pyproject.

✅ Lazy-export fix — isaaclab.envs.__init__ and isaaclab.sensors.ray_caster.__init__ switch from eager submodule imports to pure lazy_export(), matching the warp-safe pattern from the experimental package fix.

✅ Docs polish — MJWarp solver docs updated (removed ls_parallel references, clarified ls_iterations semantics), docs/_extensions/isaaclab_docs.py picks up smv_current_version.

✅ Spot velocity task — Newton config expanded with collision pipeline and shape margin settings (unrelated to this PR's camera perf work).

✅ Benchmark script — benchmark_startup.py now wraps the first env.step in torch.inference_mode().

✅ Camera demo subtitle fix — Corrected mismatched subtitles lists in scripts/demos/sensors/cameras.py.

✅ License-check workflow modernized — Switched from pip to uv (astral-sh/setup-uv).

No changes to the core frame-stacking / normalization logic in CircularBuffer, stacked_image, normalize_camera_image, or the Warp kernel code. All test files updated only to pass normalize=False/clone=True kwargs matching the new API signatures — test logic is unchanged.

Verdict

Approve — Release packaging only; the performance fix logic is untouched.

…y-up) - images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class) - images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path) - circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop - ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32

greptile-apps · 2026-06-03T14:26:34Z

Greptile Summary

This PR rewrites the frame-stacking hot path in cartpole-camera to fix a 42% throughput regression and 2.7× VRAM inflation introduced by #5574, achieving 24% faster throughput than the pre-framestack baseline.

CircularBuffer gets a stack_dim mode that arranges internal storage so .stacked returns a free contiguous reshape of K frames merged along the configured axis, eliminating a permute + reshape + clone allocation per step; torch.roll is replaced with a front-to-back narrow().copy_() loop and a CPU-side _need_reset flag skips the torch.any GPU sync on the steady-state path.
New Warp kernels (spatial_sum_uint8_tiled + normalize_image_uint8) fuse the uint8 / 255 - per-channel mean -> float32 normalize into a single launch; the PyTorch-level partial-sum collapse uses dtype=torch.int64 to prevent int32 overflow at large resolutions.
normalize_camera_image centralises dispatch and replaces five inline normalize sites; the defer_normalize pattern holds the ring buffer in uint8 until after stacking, shrinking ring memory and per-step copy bandwidth 4x.

Confidence Score: 5/5

Safe to merge. The hot-path rewrite is well-scoped, all five affected code paths have matching unit tests, and the aliasing hazard is explicitly documented and correctly avoided by allocating a fresh float32 tensor per step.

Every key invariant is backed by a targeted test: the uint8 ring stores raw frames, consecutive normalize calls return independent storage, the delay buffer clone removal preserves independence, and the is_first_push warmup fills all K slots. The int32 partial-sum overflow concern raised on an earlier version is addressed with dtype=torch.int64 in the final collapse. No correctness issues were found.

No files require special attention.

Important Files Changed

Filename	Overview
source/isaaclab/isaaclab/utils/buffers/circular_buffer.py	Adds stack_dim mode with contiguous stacked view, replaces torch.roll with a forward memmove loop, and introduces a CPU-side _need_reset flag to gate the is_first_push GPU sync; logic is correct and well-tested.
source/isaaclab/isaaclab/utils/warp/ops.py	Adds _uint8_spatial_mean (cached partials scratch) and normalize_image_uint8 wrappers; partials.sum(dim=1, dtype=torch.int64) prevents int32 overflow at large resolutions, cache key includes channel_dim for BCHW/BHWC disambiguation.
source/isaaclab/isaaclab/utils/warp/kernels.py	Adds normalize_image_uint8 and spatial_sum_uint8_tiled Warp kernels; per-tile int32 accumulation is safe at any common camera resolution and the final collapse uses int64.
source/isaaclab/isaaclab/utils/images.py	New normalize_camera_image dispatch function; correctly routes uint8 contiguous 4D inputs to the Warp fast path, non-uint8/strided to the PyTorch fallback, depth/normals/passthrough handled consistently.
source/isaaclab/isaaclab/envs/mdp/observations.py	stacked_image uses stack_dim=-1 for a free contiguous reshape, defers normalize past the uint8 ring for RGB-like data, eliminates the permute+reshape+clone allocation.
source/isaaclab_tasks/isaaclab_tasks/core/cartpole/cartpole_direct_camera_env.py	Switches to stack_dim=1 (BCHW) and defers normalize past the uint8 ring; replaces stacked.reshape+clone with .stacked + normalize_camera_image.
source/isaaclab/isaaclab/utils/buffers/delay_buffer.py	Removes the redundant .clone() after getitem; safe because advanced indexing already returns a copy independent of ring buffer storage.

_{Reviews (2): Last reviewed commit: "Address Greptile review: int64 partials ..." | Re-trigger Greptile}

jmart-nv · 2026-06-03T16:58:25Z

Moving this back to draft while I rebase onto the cartpole refactor

…y-up) - images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class) - images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path) - circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop - ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32

…isaac-sim#5574. - circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property. - observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates). - cartpole_camera_presets_env.py: same `.clone()` drop in the direct env. - delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.

…t on append

…stack consumers

…ut reshape

Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.

Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv. Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.

The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).

- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions - normalize_image_uint8: docstring warning on the out= aliasing hazard - CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage - CircularBuffer.stacked: drop unreachable empty-buffer guard - _uint8_sum_partials_cache: store the tensor directly, not a tuple - Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract

…y-up) - images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class) - images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path) - circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop - ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32

…g corrections

…uffer.stacked error

jmart-nv · 2026-06-04T21:10:02Z

After rebasing and refactoring to support the changes in #5930 which moved the channel dim to BCHW instead of BHWC.

FPS shows further improvement at 69 fps vs 54 fps baseline
VRAM shows marginal regression, up to 4.57 GB post-rebase vs 4.41GB pre-rebase (due to extra permute() each step)
Still converges around ~295 reward

Also addressed @greptile's concerns and profiled the fixes.

Updated description with latest numbers and plot. Moving this back to PR.

greptile-apps · 2026-06-04T21:11:15Z

The circular_buffer.py fix was cleanly applied — the .stacked property now correctly distinguishes the two None cases exactly as suggested. Thanks for that.

Regarding the _get_observations(clone, normalize) concern, since both comments went stale on the rebase, could you share whether the optimization params were factored out of the public signature, or did the rebase restructure things in a way that made the concern moot (e.g., the BCHW refactor changed how _get_observations is overridden)?

The perf numbers look excellent — 69.5 FPS vs 54.6 baseline is a meaningful win, and the VRAM regression from the extra permute() post-rebase is a reasonable trade for BCHW compatibility.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

…cstring

isaaclab-review-bot · 2026-06-05T15:29:12Z

    The shape of the appended data is expected to be (batch_size, ...), where the first dimension is the
    batch dimension. Correspondingly, the shape of the ring buffer is (max_len, batch_size, ...).
+
+    When ``stack_dim`` is set, the internal layout is rearranged so that :attr:`stacked`


Re-review (4df4101): New commit is a minor docstring fix (:paramref: → inline code formatting in circular_buffer.py). No new issues introduced.

✅ Previous P1 concern (misleading RuntimeError) was already addressed.
ℹ️ Previous P2 suggestion (factoring _get_observations) remains open as a non-blocking style note.

LGTM — no further action needed from my side.

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

pbarejko reviewed May 29, 2026

View reviewed changes

Comment thread source/isaaclab/isaaclab/utils/images.py Outdated

Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

pbarejko approved these changes Jun 2, 2026

View reviewed changes

jmart-nv marked this pull request as ready for review June 3, 2026 14:05

jmart-nv requested review from jtigue-bdai, kellyguo11, ooctipus and pascal-roth as code owners June 3, 2026 14:05

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Outdated

Comment thread source/isaaclab_tasks/isaaclab_tasks/direct/cartpole/cartpole_camera_env.py Outdated

jmart-nv marked this pull request as draft June 3, 2026 16:58

jmart-nv force-pushed the jmart/framestack-perf branch from a86c6a7 to 0b0d0ba Compare June 3, 2026 21:30

jmart-nv added 13 commits June 4, 2026 16:03

Drop read-side torch.roll from CircularBuffer.buffer via memmove-shif…

de05064

…t on append

Add CircularBuffer.copy_to_stacked and clone=False opt-out for frame-…

2547274

…stack consumers

Add stack_dim mode to CircularBuffer for free contiguous stacked-outp…

1f8325a

…ut reshape

Changelog fragments for framestack performance regression fix

342ff92

Restore original comments that were accidentally trimmed

c5023ce

Post-rebase cleanup: precommit formatting, variable renames, changelo…

9441d0d

…g corrections

Address Greptile review: int64 partials reduction + clearer CircularB…

5b3e944

…uffer.stacked error

jmart-nv force-pushed the jmart/framestack-perf branch from 0b0d0ba to 5b3e944 Compare June 4, 2026 21:08

jmart-nv marked this pull request as ready for review June 4, 2026 21:10

Fix docs build: drop unsupported :paramref: role in CircularBuffer do…

4df4101

…cstring

isaaclab-review-bot Bot reviewed Jun 5, 2026

View reviewed changes

pbarejko merged commit 0420e17 into isaac-sim:develop Jun 5, 2026
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cartpole-camera frame-stacking performance regression#5849

Fix cartpole-camera frame-stacking performance regression#5849
pbarejko merged 14 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

jmart-nv commented May 28, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jmart-nv commented Jun 3, 2026

Uh oh!

jmart-nv commented Jun 4, 2026

Uh oh!

greptile-apps Bot commented Jun 4, 2026

Uh oh!

isaaclab-review-bot Bot Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmart-nv commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Results

Screenshots

Environment

Checklist

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Incremental Update (a86c6a7)

Post-rebase Update (0b0d0ba)

Test Coverage

Verdict

Incremental Update (5b3e944)

Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

jmart-nv commented Jun 3, 2026

Uh oh!

jmart-nv commented Jun 4, 2026

Uh oh!

greptile-apps Bot commented Jun 4, 2026

Uh oh!

isaaclab-review-bot Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmart-nv commented May 28, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading

Incremental Update (`a86c6a7`)

Post-rebase Update (`0b0d0ba`)

Incremental Update (`5b3e944`)

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading